{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "3. NLTK.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Afanasyy/colab/blob/main/3_NLTK.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "NLTK - пакет библиотек для NLP.\n",
        "Данная библиотека поддерживает несколько языков в зависимости от задач.\n",
        "\n",
        "С помощью библиотеки можно выполнять различные задачи NLP: удаление стоп-слов, токенизация, лемматизация, стемминг и прочие задачи, которые мы рассматривали в рамках анализа библиотеки spaCy. В данном ноутбуке будут представлены задачи, которые мы не рассматривали ранее.\n",
        "\n",
        "[Предобработка текста на русском языке.](https://medium.com/@bigdataschool/%D0%BF%D1%80%D0%B5%D0%B4%D0%BE%D0%B1%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D0%BA%D0%B0-%D1%82%D0%B5%D0%BA%D1%81%D1%82%D0%B0-%D0%B2-nlp-82c164bb7416)"
      ],
      "metadata": {
        "id": "SSvQMxEPO76H"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import nltk"
      ],
      "metadata": {
        "id": "hsN0Xj9AT-qP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Стемминг\n",
        "\n",
        "Стемминг - выделение основы слова (удаление окончаний). Не всегда совпадает с корнем слова.\n",
        "\n",
        "SnowballStemmer поддерживает русский язык."
      ],
      "metadata": {
        "id": "ptSLXgXVT_iW"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "s2qDOVhiM9fA"
      },
      "outputs": [],
      "source": [
        "from nltk.stem import SnowballStemmer"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "snowball = SnowballStemmer(language=\"russian\")"
      ],
      "metadata": {
        "id": "IBfzDGNOWIMK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "snowball.stem(\"Хороший\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "ElElOEE1WNjF",
        "outputId": "176d0448-3c2d-461c-a606-2b76b437df5a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'хорош'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "snowball.stem(\"Хорошая\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "Ie_qbPEdWdgI",
        "outputId": "38e22492-47a8-4151-d0c0-1110c3e308d8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'хорош'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Словарь частот\n",
        "Получение словаря со словами и их частотностью в тексте.\n"
      ],
      "metadata": {
        "id": "NuJji5uSsOcE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "В NLTK возможно получить несколько текстов из существующих корпусов текстов:"
      ],
      "metadata": {
        "id": "7keWkZ7NtJ3O"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "nltk.download(\"book\")\n",
        "from nltk.book import *"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2Q6ShTRAs2Gp",
        "outputId": "21a5b334-ea5d-447b-cda1-4edcc9c8b58a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[nltk_data] Downloading collection 'book'\n",
            "[nltk_data]    | \n",
            "[nltk_data]    | Downloading package abc to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/abc.zip.\n",
            "[nltk_data]    | Downloading package brown to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/brown.zip.\n",
            "[nltk_data]    | Downloading package chat80 to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/chat80.zip.\n",
            "[nltk_data]    | Downloading package cmudict to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/cmudict.zip.\n",
            "[nltk_data]    | Downloading package conll2000 to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/conll2000.zip.\n",
            "[nltk_data]    | Downloading package conll2002 to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/conll2002.zip.\n",
            "[nltk_data]    | Downloading package dependency_treebank to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.\n",
            "[nltk_data]    | Downloading package genesis to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/genesis.zip.\n",
            "[nltk_data]    | Downloading package gutenberg to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/gutenberg.zip.\n",
            "[nltk_data]    | Downloading package ieer to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/ieer.zip.\n",
            "[nltk_data]    | Downloading package inaugural to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/inaugural.zip.\n",
            "[nltk_data]    | Downloading package movie_reviews to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/movie_reviews.zip.\n",
            "[nltk_data]    | Downloading package nps_chat to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/nps_chat.zip.\n",
            "[nltk_data]    | Downloading package names to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/names.zip.\n",
            "[nltk_data]    | Downloading package ppattach to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/ppattach.zip.\n",
            "[nltk_data]    | Downloading package reuters to /root/nltk_data...\n",
            "[nltk_data]    | Downloading package senseval to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/senseval.zip.\n",
            "[nltk_data]    | Downloading package state_union to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/state_union.zip.\n",
            "[nltk_data]    | Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/stopwords.zip.\n",
            "[nltk_data]    | Downloading package swadesh to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/swadesh.zip.\n",
            "[nltk_data]    | Downloading package timit to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/timit.zip.\n",
            "[nltk_data]    | Downloading package treebank to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/treebank.zip.\n",
            "[nltk_data]    | Downloading package toolbox to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/toolbox.zip.\n",
            "[nltk_data]    | Downloading package udhr to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/udhr.zip.\n",
            "[nltk_data]    | Downloading package udhr2 to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/udhr2.zip.\n",
            "[nltk_data]    | Downloading package unicode_samples to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/unicode_samples.zip.\n",
            "[nltk_data]    | Downloading package webtext to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/webtext.zip.\n",
            "[nltk_data]    | Downloading package wordnet to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/wordnet.zip.\n",
            "[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.\n",
            "[nltk_data]    | Downloading package words to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/words.zip.\n",
            "[nltk_data]    | Downloading package maxent_treebank_pos_tagger to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.\n",
            "[nltk_data]    | Downloading package maxent_ne_chunker to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.\n",
            "[nltk_data]    | Downloading package universal_tagset to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping taggers/universal_tagset.zip.\n",
            "[nltk_data]    | Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping tokenizers/punkt.zip.\n",
            "[nltk_data]    | Downloading package book_grammars to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping grammars/book_grammars.zip.\n",
            "[nltk_data]    | Downloading package city_database to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping corpora/city_database.zip.\n",
            "[nltk_data]    | Downloading package tagsets to /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping help/tagsets.zip.\n",
            "[nltk_data]    | Downloading package panlex_swadesh to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    | Downloading package averaged_perceptron_tagger to\n",
            "[nltk_data]    |     /root/nltk_data...\n",
            "[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.\n",
            "[nltk_data]    | \n",
            "[nltk_data]  Done downloading collection book\n",
            "*** Introductory Examples for the NLTK Book ***\n",
            "Loading text1, ..., text9 and sent1, ..., sent9\n",
            "Type the name of the text or sentence to view it.\n",
            "Type: 'texts()' or 'sents()' to list the materials.\n",
            "text1: Moby Dick by Herman Melville 1851\n",
            "text2: Sense and Sensibility by Jane Austen 1811\n",
            "text3: The Book of Genesis\n",
            "text4: Inaugural Address Corpus\n",
            "text5: Chat Corpus\n",
            "text6: Monty Python and the Holy Grail\n",
            "text7: Wall Street Journal\n",
            "text8: Personals Corpus\n",
            "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Получение словаря частот:"
      ],
      "metadata": {
        "id": "vO1mLmJgtRAH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from nltk import FreqDist\n",
        "\n",
        "frequency_distribution = FreqDist(text6) #text6 - текст из корпуса\n",
        "frequency_distribution.most_common(20)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5sBif2EitPc4",
        "outputId": "0fef3fae-b285-4432-9fe0-8b4e40af3d4b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[(':', 1197),\n",
              " ('.', 816),\n",
              " ('!', 801),\n",
              " (',', 731),\n",
              " (\"'\", 421),\n",
              " ('[', 319),\n",
              " (']', 312),\n",
              " ('the', 299),\n",
              " ('I', 255),\n",
              " ('ARTHUR', 225),\n",
              " ('?', 207),\n",
              " ('you', 204),\n",
              " ('a', 188),\n",
              " ('of', 158),\n",
              " ('--', 148),\n",
              " ('to', 144),\n",
              " ('s', 141),\n",
              " ('and', 135),\n",
              " ('#', 127),\n",
              " ('...', 118)]"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# N-grams\n",
        "Ngrams -  последовательность из n элементов. В NLP в качестве элементов чаще всего используют токены, хотя могут быть использованы и другие элементы."
      ],
      "metadata": {
        "id": "gQc2xS2tWj2g"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from nltk import ngrams\n",
        "\n",
        "sentence = \"I really like python, it's pretty awesome.\"\n",
        "n_grams = ngrams(sentence.split(), 3) #можно брать не только 3-граммы; n может быть равен любому другому числу, меньшее или равное длине текста.\n",
        "\n",
        "for i in n_grams:\n",
        "  print(i)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5y8c8DbylOL_",
        "outputId": "072868f4-95dc-47b0-9682-698c862a1d6f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "('I', 'really', 'like')\n",
            "('really', 'like', 'python,')\n",
            "('like', 'python,', \"it's\")\n",
            "('python,', \"it's\", 'pretty')\n",
            "(\"it's\", 'pretty', 'awesome.')\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Перед нахождением n-gram можно также предобработать текст."
      ],
      "metadata": {
        "id": "N5c-3dg_lvLw"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Коллокация\n",
        "Коллокация - словосочетания, наиболее часто встречающиеся вместе и имеющие целостность. (Пример из википедии: например, *ставить условия* — выбор глагола ставить определяется традицией и зависит от существительного условия, при слове *предложение* будет другой глагол — *вносить*)\n"
      ],
      "metadata": {
        "id": "2YHMuyRUxOyp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text8.collocations() #коллокации в загруженном тексте 8"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vuEkOW3kyPP-",
        "outputId": "6c17dc68-1be5-44ea-8bbf-c736bd920730"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "would like; medium build; social drinker; quiet nights; non smoker;\n",
            "long term; age open; Would like; easy going; financially secure; fun\n",
            "times; similar interests; Age open; weekends away; poss rship; well\n",
            "presented; never married; single mum; permanent relationship; slim\n",
            "build\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Для нахождения частотных сочетаний слов, которые могут находиться в тексте в разных формах, можно применить лемматизацию."
      ],
      "metadata": {
        "id": "zl0aMUsfygkG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Конкорданс (concordance)\n",
        "\n",
        "Конкорданс в NLP - каждое появление слова в тексте с его окружением."
      ],
      "metadata": {
        "id": "cpK0xYJx2Pb3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text8.concordance(\"man\") #слово \"man\" в тексте 8"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "IDyWEx_z2fpo",
        "outputId": "cf505d73-459b-452d-9a42-dba3fc610af6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Displaying 14 of 14 matches:\n",
            " to hearing from you all . ABLE young man seeks , sexy older women . Phone for \n",
            "ble relationship . GENUINE ATTRACTIVE MAN 40 y . o ., no ties , secure , 5 ft .\n",
            "ship , and quality times . VIETNAMESE MAN Single , never married , financially \n",
            "ip . WELL DRESSED emotionally healthy man 37 like to meet full figured woman fo\n",
            " nth subs LIKE TO BE MISTRESS of YOUR MAN like to be treated well . Bold DTE no\n",
            "eeks lady in similar position MARRIED MAN 50 , attrac . fit , seeks lady 40 - 5\n",
            "eks nice girl 25 - 30 serious rship . Man 46 attractive fit , assertive , and k\n",
            " 40 - 50 sought by Aussie mid 40s b / man f / ship r / ship LOVE to meet widowe\n",
            "discreet times . Sth E Subs . MARRIED MAN 42yo 6ft , fit , seeks Lady for discr\n",
            "woman , seeks professional , employed man , with interests in theatre , dining \n",
            " tall and of large build seeks a good man . I am a nonsmoker , social drinker ,\n",
            "lead to relationship . SEEKING HONEST MAN I am 41 y . o ., 5 ft . 4 , med . bui\n",
            " quiet times . Seeks 35 - 45 , honest man with good SOH & similar interests , f\n",
            " genuine , caring , honest and normal man for fship , poss rship . S / S , S / \n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Тональность текста\n",
        "\n",
        "Выявление эмоциональной окрашенности текста (положительный/нейтральный/отрицательный)."
      ],
      "metadata": {
        "id": "OtaM1c_RlNQ2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd"
      ],
      "metadata": {
        "id": "SnS9p9f0pa8_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
        "nltk.download('vader_lexicon')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "a2RCXPUGqb0s",
        "outputId": "04ffc519-8815-49aa-ee32-242c41d016ff"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[nltk_data] Downloading package vader_lexicon to /root/nltk_data...\n",
            "[nltk_data]   Package vader_lexicon is already up-to-date!\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "sentence = \"The movie was too good.\"\n",
        "sid = SentimentIntensityAnalyzer()\n",
        "ss = sid.polarity_scores(sentence)\n",
        "print(sentence, ss)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "I0yH7knXqdas",
        "outputId": "87b6354f-9f10-48e5-bcf6-90b7a11e7d97"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The movie was too good. {'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Для примера берется датасет с репликами персонажей из мультфильма Аватар (avatar.csv), и для каждой реплики применяется анализ тональности с помощью VADER (Valence Aware Dictionary for Sentiment Reasoning)."
      ],
      "metadata": {
        "id": "hujAioedouRJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# просто анализ датасета\n",
        "df_avatar = pd.read_csv('avatar.csv', encoding= 'unicode_escape')\n",
        "df_avatar_lines = df_avatar.groupby('character').count()\n",
        "df_avatar_lines = df_avatar_lines.sort_values(by=['character_words'], ascending=False)[:10]\n",
        "top_character_names = df_avatar_lines.index.values\n",
        "\n",
        "# убрали несущественных персонажей\n",
        "df_character_sentiment = df_avatar[df_avatar['character'].isin(top_character_names)]\n",
        "df_character_sentiment = df_character_sentiment[['character', 'character_words']]\n",
        "\n",
        "# высчитывается оценка тональности\n",
        "sid = SentimentIntensityAnalyzer()\n",
        "df_character_sentiment.reset_index(inplace=True, drop=True)\n",
        "df_character_sentiment[['neg', 'neu', 'pos', 'compound']] = df_character_sentiment['character_words'].apply(sid.polarity_scores).apply(pd.Series)\n",
        "df_character_sentiment"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 441
        },
        "id": "x6B8_nXIokRg",
        "outputId": "20134b1d-157c-4cba-9e0f-5679e102871d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[nltk_data] Downloading package vader_lexicon to /root/nltk_data...\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "     character                                    character_words    neg  \\\n",
              "0       Katara  Water. Earth. Fire. Air. My grandmother used t...  0.130   \n",
              "1        Sokka  It's not getting away from me this time.  Watc...  0.000   \n",
              "2       Katara                                       Sokka, look!  0.000   \n",
              "3        Sokka  Sshh! Katara, you're going to scare it away.  ...  0.200   \n",
              "4       Katara                          But, Sokka! I caught one!  0.000   \n",
              "...        ...                                                ...    ...   \n",
              "7053      Zuko  At least you don't look like a boar-q-pine! My...  0.183   \n",
              "7054      Suki              And why did you paint me firebending?  0.000   \n",
              "7055     Sokka  I thought it looked more exciting that way.  O...  0.000   \n",
              "7056      Iroh  Hey, my belly's not that big anymore. I've rea...  0.000   \n",
              "7057      Toph                 Well I think you all look perfect!  0.000   \n",
              "\n",
              "        neu    pos  compound  \n",
              "0     0.804  0.066   -0.6874  \n",
              "1     1.000  0.000    0.0000  \n",
              "2     1.000  0.000    0.0000  \n",
              "3     0.800  0.000   -0.5411  \n",
              "4     1.000  0.000    0.0000  \n",
              "...     ...    ...       ...  \n",
              "7053  0.817  0.000   -0.4007  \n",
              "7054  1.000  0.000    0.0000  \n",
              "7055  0.687  0.313    0.7501  \n",
              "7056  1.000  0.000    0.0000  \n",
              "7057  0.396  0.604    0.7263  \n",
              "\n",
              "[7058 rows x 6 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7cc54b5a-0d2d-4adc-9240-0cf34a117c8c\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>character</th>\n",
              "      <th>character_words</th>\n",
              "      <th>neg</th>\n",
              "      <th>neu</th>\n",
              "      <th>pos</th>\n",
              "      <th>compound</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Katara</td>\n",
              "      <td>Water. Earth. Fire. Air. My grandmother used t...</td>\n",
              "      <td>0.130</td>\n",
              "      <td>0.804</td>\n",
              "      <td>0.066</td>\n",
              "      <td>-0.6874</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sokka</td>\n",
              "      <td>It's not getting away from me this time.  Watc...</td>\n",
              "      <td>0.000</td>\n",
              "      <td>1.000</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.0000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Katara</td>\n",
              "      <td>Sokka, look!</td>\n",
              "      <td>0.000</td>\n",
              "      <td>1.000</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.0000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sokka</td>\n",
              "      <td>Sshh! Katara, you're going to scare it away.  ...</td>\n",
              "      <td>0.200</td>\n",
              "      <td>0.800</td>\n",
              "      <td>0.000</td>\n",
              "      <td>-0.5411</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Katara</td>\n",
              "      <td>But, Sokka! I caught one!</td>\n",
              "      <td>0.000</td>\n",
              "      <td>1.000</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.0000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7053</th>\n",
              "      <td>Zuko</td>\n",
              "      <td>At least you don't look like a boar-q-pine! My...</td>\n",
              "      <td>0.183</td>\n",
              "      <td>0.817</td>\n",
              "      <td>0.000</td>\n",
              "      <td>-0.4007</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7054</th>\n",
              "      <td>Suki</td>\n",
              "      <td>And why did you paint me firebending?</td>\n",
              "      <td>0.000</td>\n",
              "      <td>1.000</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.0000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7055</th>\n",
              "      <td>Sokka</td>\n",
              "      <td>I thought it looked more exciting that way.  O...</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.687</td>\n",
              "      <td>0.313</td>\n",
              "      <td>0.7501</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7056</th>\n",
              "      <td>Iroh</td>\n",
              "      <td>Hey, my belly's not that big anymore. I've rea...</td>\n",
              "      <td>0.000</td>\n",
              "      <td>1.000</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.0000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7057</th>\n",
              "      <td>Toph</td>\n",
              "      <td>Well I think you all look perfect!</td>\n",
              "      <td>0.000</td>\n",
              "      <td>0.396</td>\n",
              "      <td>0.604</td>\n",
              "      <td>0.7263</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>7058 rows × 6 columns</p>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7cc54b5a-0d2d-4adc-9240-0cf34a117c8c')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-7cc54b5a-0d2d-4adc-9240-0cf34a117c8c button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-7cc54b5a-0d2d-4adc-9240-0cf34a117c8c');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    }
  ]
}